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Abstract 



Determining the cross-cultural equivalence of multilingual tests is a challenge that is more complex than simple 
horizontal equating of test forms. This study examines the functioning of a trilingual test of preschool readiness 
to determine the equivalence. Different forms of the test have previously been examined using classical 
statistical techniques on bilingual forms (Lang, Chew & Schomber, 1992). In this effort, the primary goal was 
to determine instrument characteristics and explore test language differences using the Rasch model of item 
analysis. 
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Using the Rasch Model to Determine Equivalence of Forms 
In the Trilingual Lollipop Readiness Test 

Multilingualism has been a problem in psychoeducational assessment for almost a century (Arsenian, 
1937). A number of studies have described the difficulties of using standardized instruments developed for 
English language instruments with limited-English-proficiency (EEP) or bilingual children (Eigueroa, 1983, 
Valencia & Rankin, 1985). Translations of the Eollipop Test of School Readiness into Spanish (Chew, 1989) 
and Erench (Venet, Normandeau, Eetarte, & Bigras 2003) from the original English (Chew, 1981) may be 
linguistically sound, but the question of cultural construct equivalence is also important to valid, reliable, and 
fair assessment. Mardell-Czudnowski (1987) recommends that American norm tests should be interpreted with 
extreme caution when used with multilingual children. 

The need for multilingual editions of a culturally-fair school readiness test was the motivation for the 
development of Eollipop Test in Spanish and Erench translations. In at least one study with the Eollipop test, 
equal total scores did not necessarily translate into construct equivalence for bilingual students (Vargas & Eang, 
1994). With four or five year old children who sometimes mix two languages in a multilingual cultural, a 
developmental readiness test needs to work at the item level and be as culturally-fair as possible for valid 
readiness measurement. 

Background of the Readiness Construct and Lollipop Tesi 

The concept of readiness can be related to achievement in various school subjects and at various levels of 
the curriculum. “Traditionally, however, readiness refers to the intellectual and social characteristics of 
kindergarten and first grade children. Readiness may be defined as a general responsiveness to instruction or 
as specific intellectual abilities that are predictive of the development of reading or of arithmetic skills” (Eeton 
& Rutter, 1973, p. 293). Readiness as defined by Ausubel (1959, p. 247) refers to “the adequacy of existing 
capacity in relation to the demands of a given learning task.” Generally, readiness refers to the capacity for 
meeting successfully certain expectancies or for achieving particular levels of performance (Brandt, 1971). 

Even if we think of readiness only as capacity, Brandt suggests that at least three different, though interrelated, 
kinds of readiness stand out as especially important. One is physical maturity, which has long been recognized 
as a precursor to jumping, skipping, bicycling and other gross motor activities of early childhood. Physical 
maturity is also related to various intellectual activities and school accomplishments. 

The second kind of readiness is socio-economical, having to do with developing personality qualities and 
the strides already taken toward independence. Kindergarteners who are relatively outgoing and self-assured 
are apt to find the school world exciting and enjoyable. The dependent, withdrawn child, on the other hand, is 
at a different stage of development and initially needs the warm assurance and helping hand which good schools 
offer before other horizons can be fully developed. The third kind of readiness discussed by Brandt might be 
labeled intellectual-educational, referring to the fact that during the first years of life children have already had 
many experiences that shape their thinking and fill their minds. By age 6 their vocabulary averages well over 
2,500 words, although culturally deprived children may know only half of this number. In the Encyclopedia of 
Educational Research, 4* edition, Tyler (1969) offers an in depth, concise review of the literature on the 
concept of readiness. In a similar earlier paper, Tyler (1964) presents a discussion of issues related to readiness 
to learn. The idea of readiness has been and still is inherent in our thinking about education, although Tyler’s 
review suggests that there is considerable diversity of opinion about both practice and theory. Different 
theorists have different notions about the nature of child development. Some— for instance, Gesel— consider that 
it is a matter of maturation forces, experiences, formal and informal teaching, and equilibration— i.e., the 
organization of knowledge occasioned by the appearance of contradictions in the child’s system of beliefs. Still 



* This section adapted from Chew, A. E. (in press). Developmental and Interpretive Manual for the Lollipop 
Test of School Readiness Ed.). Humanics Psychological Test Corp.: Eake Worth, EE. 
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others, e.g., learning psychologists, regard it as a matter of accumulating new behavior through learning from 
experiences. More recent discussions still debate the exact components and theories of readiness, but there 
remains a general consensus that readiness assessment is common and useful for preschool and readiness 
programs (Kagan, 1990; Gredler, 1997). 

Instrument History 

The Lollipop Test consists of three components and 52 items: (1) a set of seven Stimulus Cards used for 
presentation of items throughout the test, (2) the Administration and Scoring Booklet, and (3) the 
Developmental and Interpretive Manual. The Stimulus Cards are used with the combination Administrative 
and Scoring Booklet. The test contains four sections or subtests: Test 1: Identification of colors and shapes, 
and copying shapes; Test 2: Picture description, position, and spatial recognition; Test 3: Identification of 
numbers, and counting; and Test 4: Identification of letters, and writing. Through several revisions, the 
Lollipop Test has been popular and demonstrated validity and reliability in the English edition with classical 
item analysis, longitudinal predictions, and comparative correlations with other instruments. This study is the 
first Differential Item Functioning (DIF) analysis of the Lollipop Test and the first trilingual scaling of the 
items. Research and classical test theory analyses from the last thirty years are summarized in Table 1. 

Table 1 

Chronological Summary of Research on the Lollipop Test 



AUTHORtSi 


DATE 


SIGNIFICANCE OF STUDY 


Chew 


1977 


Construct Validation & Factor Analysis 


Chew 


1981 


Lollipop Test Initially Published 


Chew 


1983 


Concurrent Validity 


Chew & Morris 


1984 


Concurrent Validity 


Chew, Kesler, & Sudduth 


1984 


Developing Local Norms 


Chew 


1985 


Pre-Kindergarten Screening Application 


Chew & Morris 


1987 


Pre-Kindergarten Screening Application 


Chew & Morris 


1989 


Predictive Validity 


Chew 


1989 


Lollipop Test, Revised (2nd Edition) 


Chew 


1989 


Spanish Edition Published 


Lang & Chew 


1989 


Development of Head Start Norms 


Lang & Chew 


1989 


Norms Technical Manual; GA Head Start 


Chew & Lang 


1990 


Predictive Validity 


Lang & Chew 


1990 


Utilizing Lollipop Norms 


Lang, Chew, & Schomber 


1990 


Construct Validity of Spanish Edition 


Chew & Lang 


1992 


Predictive Validity of Spanish Edition 


Lang, Chew, & Schomber 


1992 


Regression Equating Spanish & English 


Chew & Lang 


1993 


Concurrent Validity of Spanish Edition 


Lang, Chew & Vargas 


1993 


Rasch Analysis of Spanish & English 


Vargas & Lang 


1994 


Cultural Fairness of the Spanish & English 


Lang, Chew, & Vargas 


1995 


Equating the Spanish & English Editions 


Lang, Chew & Gill 


1995 


Longitudinal Predictive Validity 


Eno & Woehlke 


1995 


Predictive Validity 


Lang & Chew 


1997 


Predictive Validity 


Chew 


2001 


Head Start Norms 


Venet, Normandeau, et. al. 


2003 


Validation of French Translation 


Chew 


2007 


Lollipop Test-Ill (3rd Edition) 



Note: See reference list for complete citations for publications listed above. 
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Purpose of this study 

Although multiple studies utilizing classical methods have been conducted on the Lollipop Test and each 
individual language translation of it, no studies prior to this one have equated the test results for children who 
are native speakers of the three different languages tested - English, French, and Spanish. The purpose of this 
study was to determine the equivalence of the three versions of the test by exploring item functioning 
employing the Rasch model (1980). 



Method 

Analyses 

Modern test development methods now include vertical and horizontal equating. The specific 
methodology for equating that incorporates the Rasch model is well established (Wolfe, 2004), but only 
provides partial application to this effort since there are no common people to calibrate and items translation 
equivalence is the issue. The Rasch model was deemed appropriate for this effort because the construct of the 
instrument was considered established, but the need for sample independent analysis was evident. The Rasch 
model also met the requirements of analysis appropriate with different aggregations of items, handling of 
missing data, and sensitivity to misfit from the construct. 

This study used an exploratory analysis that started with overall model statistics and followed up as 
patterns were revealed, because a strict equating model utilizing common items or persons is impossible. 
Perhaps a better description of the problem would be to compare different forms of a test with no common 
persons and parallel, but not common items. The Rasch model was expected to be useful for overall 
consideration of dimensionality, reliability, separation and fit. Applying the Rasch model started with 
calibration of items, and examined the overall estimates of the model parameters (Smith, 2001). At least one 
demonstration of item calibration on a school readiness test has been previously demonstrated (Gumpel, 1999). 
Additionally, a recent analysis a cross-cultural comparison using the Rasch model has been effectively 
demonstrated (Choi, Meric le, & Karachi, 2006). Those efforts guided the study reported here, even though the 
planned analysis was admittedly investigative. 

A statistic produced by the Rasch model is Differential Item Functioning (DIF), but DIF analysis 
requires both an appropriately modeled test and detailed knowledge of the construct (Bond & Fox, 2001). DIF 
was examined for each language form of the Lollipop Test separately and as a pooled set of items. DIF analyses 
by the four subtests was performed. Analyses by pretest scores, posttest scores, and combined pretest-posttest 
were performed. Particular attention was focused on examining patterns, item functioning, and contrasts that 
could be interpreted as culturally loaded or indicating possible language bias. Analysis results were interpreted 
within the framework of language expertise and construct knowledge. The choice of separate analysis of 
pretest, posttest, or both sets of items is dependent on the frame of reference that is meaningful to the user's 
assertions (Wright, 1996). The comparative language of testing was the primary variable of interest while the 
most common use of the test is for pretest-posttest gains utilizing the four subtests, therefore that was the focus 
of the researchers. 

Sampling 

The initial data analysis included 597 children tested, including 191 English speaking, 234 French 
speaking, and 112 Spanish speaking children. Of the 597 children, 529 children were tested twice (pre-post). 
Sixty-eight children were available for one, but not both testing opportunities. Many programs test twice to 
provide evidence of effectiveness for various funding requirements. Some of the sample, particularly the 
migrant population, were only available for one testing session. This reduced the Spanish speaking group most 
often. Of the 1 12 Spanish children, 70 were post tested while 42 left during the year. A small number of 
subjects that varied with each process were occasionally dropped from exploratory analyses as extreme or 
misfitting persons. 
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The sample was collected in North America from multiple locations and all subjects were enrolled in 
pre-kindergarten or kindergarten programs. Many children represented cultures new to the United States or 
Canada and therefore the spoken language at home was French or Spanish. All schools used English as the 
language of instruction. Children were typically tested at program entrance and exit. The children resided in 
Florida, Georgia, Michigan, and Quebec (Canada). The ages of the sample ranged from 48 to 72 months (4 to 
6 years old). Specific locations and schools are not identified here, but a variety of urban to rural populations, 
levels of SES, and preschool programming models was included. The sample generalizability would not be 
randomly drawn from nor specific to a defined population, but the functioning of different test forms within this 
heterogeneous sample was considered the focus of the investigation. 

Since preschool programming varied widely as to curriculum, teacher quality, required age for entrance, 
and contact hours; generalizable sample gains were considered globally as a normal consequence of the 
kindergarten experience, but without specific consideration of differential program impacts. Many children 
were selected as qualified for targeted remedial programs such as Head Start, but enrollment was controlled by 
the local school requirements. 

The children in the sample were racially and ethnically diverse with 1 10 black, 1 15 Hispanic 
(representing South American, European, and island nations), 21 Eastern (representing Asian and Pacific island 
nations), and 350 white (representing North American, African, European, and South American nations) 
children identified by ethic origin. The sample was composed of 310 female and 284 male subjects. Even 
though it was not the question of interest for this study, an initial DPF (Differential Person Fit) analysis 
indicated no contrasts larger than .5 logits between ethnic subpopulations or gender. No contrasts by gender or 
origin were significant for total scores. 

Results 

Basic Results 

The first conclusion is that the Eollipop Test meets Rasch dimensional requirements for continued 
analysis and calibration. All analyses (separate and pooled) were examined with both Winsteps (2003) 
Diagnosis/Separation tables and RUMM2020 (Andrich, 2004a) Summary Statistics tables. Since RUMM2020 
and Winsteps use slightly different estimation algorithms, the results in side-by-side comparisons can be 
confusing. For consistency, Winsteps output is reported except in the cases where RUMM2020 reports or 
graphics are indicated in this report. A sample summary result is included in Figure 1 below, produced with 
Winsteps software. The real person separation of .91 indicates that the scale discriminates between the persons 
well. The real item separation of .99 indicates that the items create a well defined variable. An outfit mean of 
.1 (expected value = 0) and S. D. of 1.3 (expected value = 1.0) suggest that the data fit the model. Our 
separation table provides evidence that a Rasch analysis is reasonable (Smith, 1999). In all cases, runs met the 
guidelines suggested for Winsteps (Smith, 1999) or were rated Excellent on RUMM2020's Power of Test-of- 
Fit. Extreme items and persons (misfitting) were typically dropped from some later subanalyses. 
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TABLE 3.1 Lollipop Test Analysis Three Languages ZOU782ws.txt Mar 16 15:40 2007 
INPUT: 597 persons, 104 items MEASURED: 597 persons, 104 items, 84 CATS 3.57.2 



+ + 

I RAW MODEL INEIT OUTEIT | 

I SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD 

I 

I MEAN 67.8 89.0 5.00 .45 1.30 .9 1.21 .1 

I S.D. 29.3 23.5 1.98 .16 .48 1.4 1.01 1.3 

I MAX. 135.0 104.0 13.14 1.85 3.54 5.2 9.90 5.5 

I MIN. 11.0 50.0 -1.63 .34 .47 -2.5 .04 -2.3 



I REAL RMSE .58 ADJ.SD 1.89 SEPARATION 3.24 person RELIABILITY .91 
IMODEL RMSE .48 ADJ.SD 1.92 SEPARATION 4.01 person RELIABILITY .94 
I S.E. OE person MEAN = .08 

+ + 



VALID RESPONSES: 85.6% 

person RAW SCORE-TO-MEASURE CORRELATION = .68 (approximate due to missing data) 

CRONBACH ALPHA (KR-20) person RAW SCORE RELIABILITY = .96 (approximate due to missing data) 

SUMMARY OE 104 MEASURED items 

+ + 

I RAW MODEL INEIT OUTFIT 





SCORE 


COUNT 


MEASURE 


ERROR 


MNSQ 


ZSTD 


MNSQ 


ZSTD 


1 MEAN 


389.1 


511 . 0 


12 . 92 


.23 


. 98 


-1.0 


1.22 


-1.0 


1 S.D. 


259.5 


16.7 


5.27 


. 12 


.30 


4.6 


1 . 14 


4 . 1 


1 MAX. 


1657 . 0 


529.0 


20.62 


.85 


2.16 


9.9 


6.58 


9.9 


1 MIN. 


78.0 


492 . 0 


1 .79 


.07 


. 51 


-9.0 


.21 


-6.8 


1 REAL 


RMSE 


.26 


ADJ.SD 


5.26 SEPARATION 


20.43 item 


RELIABILITY 


1.00 


IMODEL 


RMSE 


.25 


ADJ.SD 


5.26 SEPARATION 


20.64 item 


RELIABILITY 


1.00 


1 S.E. 

X 


OF item MEAN 


= .52 















UMEAN=5.000 USCALE=2.000 item RAW SCORE-TO-MEASURE CORRELATION = -.81 



Figure 1. Output of Winsteps Results from Overall Model Analysis 



Differential Item Analyses 

Smith (2004) described the detection of item bias with the t-statistic. In a simulation. Smith suggests 
parameters bias detection using reference groups and focal groups of different sizes and varying the proportion 
of biased items. Smith concludes, "The Rasch item bias statistics lack the power to detect bias of less than .5 
logits unless there are 500 people in each subpopulation. . . .The use of preset item bias cutoffs like +2 and -2 to 
detect items that favor one subpopulation over another in pairwise comparisons can be misleading." (p. 415- 
416). Given the complexity of a three-way comparison plus the differing degrees of freedom in each 
comparison, a graphic representation is provided here in Figure 2 (pre-test items) and Figure 3 (post-test data) 
for exploratory examination generated by Winsteps / Excel. 

On first viewing the pre-test calibration, item 1.07 appears to be unusually easy for English language 
students. Item 1.07 item is a performance item, "Show me the circle." Item 2.06 appears as unusually easy for 
Spanish language students. That item refers to a picture of a cat and kittens and asks, "Show me which is the 
biggest." On the post-test calibration, a number of items asking for color recognition and counting appeared to 
be easier for Spanish speaking students. On both the pre-test and post-test. Identification of Eetters and Writing 
(subtest four), appeared more difficult for Erench language students. Keeping in mind the power and Type I 
error rate issues raised by Smith (2004), a table showing the visually suspect items narrowed to those with 
statistically significant differences (p <.05) and logit contrasts greater than 1.0 were identified. 
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E=English, F=French, S=Spanish 

Items are labeled by Subtest and Item Number. For example Pre-test 3.04= Subtest 3, Item #4. Only the odd number items are 
labeled on the graph. 



Figure 2. Plot of Pretest Items Comparing English, French, and Spanish Language Versions 

In the plots the DIF measure is the difficulty of the item for each group with all else held constant 
plotted on overlapping lines where the t-value gives the significance of the unit normal deviate (Linacre, 2003). 
Graphically, the French version appears to differ on subtest four (to the right) on the plot. 



person DIF plot (DIF=$S12W1) 




E=English, F=French, S=Spanish Items are labeled by Subtest and Item Number. For example Pre-test 3.04 = Subtest 3, Item #4. 
Only the odd number items are labeled on the graph. 



Figure 3. Plot of Posttest Items Comparing English, French, and Spanish Language Versions 



Figure 3 demonstrates that some items appear easy on subtest one. That is somewhat expected as a 
posttest after instruction and a year's growth. Again, the French test appears to separate on subtest one and 
sub test four on the plot. 

One item that graphically appears to be extremely different (pre-test 2.06) actually is a case of a very 
easy item that is even easier for the Spanish subgroup. With the worst point biserial correlation on the test 



Lang, Chew, Crownover, & Wilkerson 



17 



(rbis=.14) and a difficulty of -3.45, the graph is misleading as the contrast is large, but not significant. Some 
items in subtests one and two have some variability without a clear pattern explaining the variance other than 
error. A few post-test items seem to have become very easy for Spanish speaking students, most likely as a 
result of targeted instruction. On the pre-test, the English students appear to perform best across subtest four, 
while on the post-test the Spanish students are favored. Possible interpretations of the results are addressed in 
the discussion section of this article, but the plots indicate enough deviation by language contrast to suggest 
additional inspection of contrasts is warranted. 
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CLASS 


MEASURE 


S.E. 


CLASS 
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S.E. 
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Number 


Name j 






























1 


E 


1 71 


. 17 


F 


2 . 77 


. 17 


-1 . 05 


.24 


-4 . 37 


414 


.0000 


15 


Pre*l . 08 


1 


F 


2 . 77 


. 17 


S 


1.39 


.21 


1 . 37 


.27 


5.02 


339 


.0000 


15 


Pre*l . 08 


1 


E 


-4.22 


.10 


F 


-5.78 


. 11 


1 . 56 


. 14 
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414 


.0000 


25 


Pre*l .131 


1 


F 


-5 . 78 


. 11 


S 


-4 . 62 


. 12 


-1.16 


.16 


-7 17 


339 


.0000 


25 


Pre*l . 13 1 


1 


E 


-2.90 


. 15 


F 


-4 . 50 


.08 


1.60 


. 17 


9.43 


414 


.0000 


27 


Pre*l . 14 1 


1 


F 


-.57 


.18 


S 


1 57 


.21 


-2 . 15 


.28 


-7.60 


339 


.0000 


33 


Pre*2 . 03 | 


1 


F 


1 . 42 


. 14 


S 


2 . 57 


.25 


-1 . 15 


.29 


-3.98 


339 


.0001 


43 


Pre*2 . 08 


1 


E 


1.39 


.16 


F 


2.65 


. 17 


-1.26 


.24 


-5.36 


412 
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77 


Pre*4 . 01 
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2 . 37 


.18 


F 
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83 


Pre*4 . 04 
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4.21 
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412 
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93 


Pre*4 .09 
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.40 


2 . 91 


339 


.0039 
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Pre*4 .09 
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3.38 
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.0008 
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16 
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2.23 


.29 


1.23 


.34 


3.67 


300 


.0003 


88 


Post4 .061 


1 


E 


2 .16 


. 17 


F 


3.58 


. 17 


-1.43 


.24 


-5.86 


422 


.0000 


90 


Post4 .071 


1 


F 


3.58 


. 17 


S 


2.23 


.29 


1 . 35 


.34 


4.00 


300 


.0001 


90 


Post4 .071 


1 


E 


2 . 10 


. 17 


F 


3.35 


. 17 


-1.25 


.24 


-5.27 


422 


.0000 


92 


Post4 .081 


1 


F 


3.35 


. 17 


S 


1.88 


.31 


1 . 47 


.35 


4.23 


300 


.0000 


92 


Post4 .081 


1 


E 


2 . 18 


. 17 


F 


4.23 


.21 


-2 . 04 


.27 


-7 .59 


422 


.0000 


94 


Post4 .09 


1 


F 


4.23 


.21 


S 


2 . 40 


.28 


1.83 


.35 


5.20 


300 


.0000 


94 


Post4 .09 


1 


E 


2.60 


.16 


F 


3.74 


.18 


-1 . 14 


.24 


-4 . 65 


422 


.0000 


96 


Post4 .10 


1 


F 


3.74 


.18 


S 


2.23 


.29 


1 . 51 


.34 


4 41 


300 


.0000 


96 


Post4 .10 


1 


F 


2.66 


. 15 


S 


1 . 58 


.33 


1.08 


.36 


3.01 


300 


.0028 


98 


Post4 . 11 1 


1 


E 


1.93 


. 17 


F 


3 . 17 


.16 


-1.24 


.24 


-5.25 


423 


.0000 


100 


Post4 . 12 


1 


F 


3.17 


.16 


S 


1.68 


.32 


1 . 48 


.36 


4 17 


300 


.0000 


100 


Post4 . 12 


1 


E 


2 .19 


. 17 


F 


3.67 


.18 


-1 . 48 


.25 


-6.03 


423 


.0000 


102 


Post4 .131 


1 


F 


3.67 


.18 


S 


1 .79 


.31 


1.88 


.36 


5.22 


299 


.0000 


102 


Post4 .131 



E=English, F=French, S=Spanish. Contrasting items (p < .05) with a magnitude greater than 1 logit. 

Grey indicates the item contrasted on both the pretest and the posttest. Yellow is the only non-French contrast identified. 



Figure 4. Winsteps output of items with significant DIF (p < .05) and a contrast larger than 1 logit. 
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Figure 4 contains the DIF analysis by Winsteps for the Pretest and Posttest items identified as significant 
(p < .05) contrasts by language with a magnitude greater than one logit. The grey bars indicate items which 
show the parallel contrast if the item appeared on both the Pretest and Posttest. Only one item appears to show 
an English-Spanish contrast. With 103 contrasts, it is possible that some appear significant due to type I error. 
Figure 5 illustrates the Winsteps DIF analysis by subtests for both pretest and posttest analyses where the 
contrast was significant and the magnitude of difference was greater than .25 logits. Again, all the subtest 
contrasts that met this admittedly arbitrary criteria contained the French language test. 



TABLE 33.1 Lollipop Test Analysis Three Languages ZOU949ws.txt Apr 3 1:46 2007 

INPUT: 596 persons, 104 items MEASURED: 529 persons, 52 items, 45 CATS 3.57.2 



CLASS-LEVEL BIAS / INTERACT IONS FOR DIF=$S12W1 AND DPF=$S5W1: PRETEST 



+- 

1 

1 


person 

CLASS 


DIF 

MEASURE 


DIF 

S.E. 


person 

CLASS 


DIF 

MEASURE 


DIF 

S.E. 


DIF 

CONTRAST 


JOINT 

S.E. 


t 


d.f. 


Prob . 


item 

CLASS 


+ 




























1 


E 


.21 


.04 


F 


-.21 


.04 


. 41 


.05 


-7.60 


INF 


.0000 


I 




1 


F 


-.21 


.04 


S 


.09 


.05 


-.30 


.06 


4 . 67 


INF 


.0000 


I 




1 


F 


-.12 


.03 


S 


.16 


.05 


-.28 


.06 


4 . 54 


INF 


.0000 


2 




1 


F 


.09 


.03 


s 


-.18 


.05 


.27 


.06 


-4 . 50 


INF 


.0000 


3 




1 


E 


-.27 


.04 


F 


.26 


.04 


-.52 


.05 


9.54 


INF 


.0000 


4 




1 


F 


.26 


.04 


s 


-.07 


.06 


.32 


.07 


-4 . 75 


INF 


.0000 


4 




+- 




















— 






+ 



TABLE 33.1 Lollipop Test Analysis Three Languages ZOU015ws.txt Apr 3 9:12 2007 

INPUT: 596 persons, 104 items MEASURED: 596 persons, 104 items, 84 CATS 3.57.2 



CLASS-LEVEL BIAS / INTERACT IONS FOR DIF=$S12W1 AND DPF=$S5W1: POSTTEST 
+ + 



person DIF DIF person DIF DIF DIF JOINT item 

CLASS MEASURE S.E. CLASS MEASURE S.E. CONTRAST S.E. t d.f. Prob. CLASS 





E 


. 14 


.03 


F 


-.18 


.03 


.33 


.04 


-8.13 


INF 


.0000 


I 




F 


-.18 


.03 


S 


.20 


.04 


-.38 


.05 


7 . 72 


INF 


.0000 


I 




E 


.18 


.03 


F 


-.22 


.02 


.40 


.04 


-11 . 1 


INF 


.0000 


2 




F 


-.22 


.02 


S 


.23 


.04 


-.45 


.05 


10.00 


INF 


.0000 


2 




E 


-.31 


.03 


F 


.40 


.03 


-.71 


.04 


19.63 


INF 


.0000 


4 




F 


.40 


.03 


S 


-.28 


.04 


. 68 


.05 


-14 . 9 


INF 


.0000 


4 



+ 4 - 



E=English, F=French, S=Spanish. Grey indicates the item contrasted on both the pretest and the posttest. 



Figure 5. Winsteps output of subtest scores (1-4) with significant DIF (p < .05); a contrast > .25 logits. 



Finally, an investigation of a subset of items in order to observe item and subtest functioning graphically 
with expected Item Characteristic Curves (ICC's) was performed. RUMM2020 was chosen for this analysis. 
The results here include examples that illustrate the process as space does not allow comprehensive output. In 
Figures 6 and 7 are the plots of the observed ICC confidence interval means from Posttest item 2.07 that 
appears to model an expected Rasch ICC reasonably well. This is followed by a contrast plot of the language 
functioning where there was no significant difference detected by the contrasts. 
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10075 Descriptor for Item 75 Loon = 0.639 Spread = -4.728 FitRes = -0.402 ChiSq[Pr] = 0.941 F[Pr] = 0.926 Slope 




Figure 6. Observed function plot of item 14 from subtest 3 administered as a pretest (RUMM2020). 



Item: Descriptor for Item 75 [10075] - 3 Levels for Person Factor: LANGUAGE 




Person Location (logits) 



Slope 

1.00 



X ENGLISH 
o SPANISH 
+ FRENCH 



Figure 7. Language contrast plot of item 14 from subtest 3 administered as a pretest (RUMM2020). 

Now we can observe an item that does not appear to behave well as an expected Rasch ICC. In Figure 8 
is a plot of item 3 from subtest 2 (2.03 in Figure 4) whose expected Cl means do not appear to have appropriate 
discrimination. In Figure 9, the same item is plotted showing the language DIF contrasts. This pretest item was 
originally identified statistically and is now confirmed graphically. 
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10033 Descriptor for Item 33 Loon = 0.041 FitRes = 5.408 ChiSq[Pr] = 0.000 F[Pr] = 0.000 Slope 




Person Location (logits) 



Figure 8. Observed plot of item 3 from subtest 2 administered as a pretest (RUMM2020). 



Item: Descriptor for Item 33 [I0033] - 3 Levels for Person Factor: LANGUAGE 




Person Location (logits) 



Slope 

0.25 



X ENGLISH 
o SPANISH 
+ FRENCH 



Figure 9. Language contrast plot of item 3 from subtest 2 administered as a pretest (RUMM2020). 

It was also confirmed that the time of testing may be important to the item functioning on this preschool 
readiness test. Originally, this was a logical assertion of the authors, but in Figures 10 and 1 1 are plots of the 
pretest and posttest functions for item (1.12). The DIF is still present, but now the difficulty of the bias has 
reversed. This item met our statistical criteria on the posttest analysis in Figure 4, but did not have a large 
enough contrast to meet the criteria for the pretest analysis. Graphically, this is clearly an item of interest. 
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Item: Descriptor for Item 23 [10023] - 3 Levels for Person Factor: LANGUAGE 




Person Location (logits) 



Slope 

1.00 



X ENGLISH 
o SPANISH 
+ FRENCH 



Figure 10. Language contrast plot of item 12 from subtest 1 administered as a pretest (RUMM2020). 



Item: Descriptor for Item 24 [10024] - 3 Levels for Person Factor: LANGUAGE 




Person Location (logits) 



Slope 

1.00 



X ENGLISH 
o SPANISH 
+ FRENCH 



Figure 11. Language contrast plot of item 12 from subtest 1 administered as a posttest (RUMM2020). 

Finally, analysis of the subtest scores was performed. It is common to report the Lollipop Test by 
subtest scores as the design of the test separates items into subtests that reflect item types and increasing 
difficulty as part of its original design. Similar to the individual item graphs. Figures 12 and 13 demonstrate the 
observed ICC of subtest two given as a posttest, and subtest four given as a posttest. These reveal performance 
differences contrasting by language. This is consistent with our original Figures 2 and 3. It is also consistent 
with the larger number of individual items expressing DIF in subtest four as opposed to subtest two. 



Of c 
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ST04 SubTest 4 Locn = -0.258 Spread = 0.068 FitRes = 1 .1 08 ChiSq[Pr] = 0.684 SampleN = 590 Slope 




Person Location (logits) 



E 

X 

P 

e 

c 

t 

e 

d 

V 

a 

I 



Item: SubTest 4 [ST04] - 3 Levels for Person Factor: LANGUAGE 




Person Location (logits) 



Slope 

8.09 



X ENGLISH 
o SPANISH 
+ FRENCH 



Figure 12. ICC and language contrast plots of subtest 2 administered as a posttest (RUMM2020). 
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ST08 SubTest 8 Locn = 0.1 77 Spread = -0.002 FitRes = 1 .1 23 ChiSq[Pr] = 0.437 SampleN = 590 Slope 




Person Location (logits) 



E 

X 

P 

e 

c 

t 

e 

d 



Item: SubTest 8 [ST08] - 3 Levels for Person Factor: LANGUAGE 




Person Location (logits) 



Slope 

26.87 



X ENGLISH 
o SPANISH 
+ FRENCH 



Figure 13. ICC and language contrast plots of subtest 4 administered as a posttest (RUMM2020). 
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Discussion 

For purposes of discussion, some items that appeared to demonstrate the most DIF across language form 
and both testing times (pre-post) were examined first. These are reproduced in Table 3 below. 

Table 3 

Items on the Trilingual Lollipop that demonstrated DIF. 



Subtest 


Item 

Number 


Item 


Language Ct 
Pre-test 


mtrast 

'^ost-test 


1 : Colors & 
Shapes 


7 


"Show me the circle." 


E-E, E-S, E-S 


E-E, E-S 


1 : Colors & 
Shapes 


8 


"Show me the rectangle." 


E-E, E-S 


E-E 


1 : Colors & 
Shapes 


13 


"Draw a cross just like this one." 


E-E, E-S 


E-S 


1 : Colors & 
Shapes 


14 


"Draw a square just like this one." 


E-E 


E-E, E-S 


2: Position & 
Spatial 


3 


"Show me (point to) the kitty that is on 
top." 


E-E, E-S 


E-E, E-S 


2: Position & 
Spatial 


8 


"Which one is first?" 


E-S 


E-E, E-S 


4: Eetters & 
Writing 


1 


"Show me the letter B." 


E-E 


E-E, E-S 


4: Eetters & 
Writing 


4 


"Show me the letter P." 


E-E 


E-E, E-S 


4: Eetters & 
Writing 


9 


"What letter is this? (D)" 


E-S 


E-E, E-S 


4: Eetters & 
Writing 


10 


"What letter is this? (H)" 


E-E 


E-E, E-S 


4: Eetters & 
Writing 


12 


"Write the letter B." 


E-E, E-S 


E-E, E-S 



E=English, S=Spanish, F=French 



Our purpose in this discussion is not to exhaust all items and possible explanations, but to describe how 
construct knowledge combined with statistical evidence might be used to explain DIF and provide reasonable 
targets for instrument improvement. 

First, in both the statistical and graphical analyses, there was a clear trend for the French language 
version to function as more difficult in subtest four on the posttest which targeted letter identification. The 
stimulus cards were the same for all languages, and the authors' explanation for the contrast is that European / 
French handwriting and letter formation is different from American / English. There is a possible cultural 
difference in handwriting instruction that may confound the subtest four items. Eor example, in an overview of 
the Erench American International School in San Erancisco (online at http://frenchamericansf.org ), a 
description of early instruction is provided here: 

"Students learn the French cursive method of writing. They begin by learning only lower case letters 
in first grade, spending time in the Erench class learning to correctly and legibly reproduce the letters. 
Children use this method for all their writing activities. Handwriting instruction continues in both 
Erench and English in second grade." 
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If French posttest children have been taught cursive as opposed to print or D'Nealian letter formation, and may 
not have the same exposure to uppercase letters, they may simply fail to recognize the capital block letters on 
the stimulus cards as readily as their Spanish and English speaking counterparts. 

The next most interesting DIF concerns both the pretest and posttest items calling for recognition of 
shapes or drawing of simple shapes (circle, cross, square). Again, the French speaking children appeared to 
differ from Spanish and English test takers. No apparent curricular or cultural differences were obvious despite 
consideration of religion (crosses), playtime activities (games that used shapes), or television (Sesame Street) 
that may have provided cultural loading for the items. The best explanation seems to be that the reproduction of 
the French test contained slightly smaller figures in the vertical direction (estimated at 5 to 10 %) and drawing 
space. Even though the area is not known to be a consideration and was unnoticed until this investigation of 
DIF, it may have caused an inadvertent effect on item functioning and this hypotheses needs to be investigated 
in the future. 

Another DIF finding for the French language items was a subset of a stimulus card with a picture of a 
"mother cat" and "kittens". Questions of spatial relations are asked, "Show me the kitty that is on top." Even 
though there is no clear reason for the DIF, there is some speculation among the authors that different words for 
cat and kitty in French (petits chats, chats, chatons) differ in difficulty from English (cat, kitty) or Spanish 
(gatitos, gatos) simply because the blends and diphthongs in French for "cat" is more complex for young 
language learners. There is also a theory that the French children are hesitant to point to the picture (pointing is 
rude). For the purposes of these discussions, these are only educated guesses at this point, but seem like viable 
construct explanations that warrant inquiry. 

The discussion here is a sample of a larger set of interpretations that demonstrate the use of Rasch DIF 
analysis and construct understanding that would lead to test improvements in the language translations of the 
Lollipop Test. 

Conclusions and Recommendations 

As more multilingual / multicultural tests are used in schools with growing international populations, 
form equivalence becomes important, especially for placement or high-stakes decisions. Analyses that are 
sensitive to item differences are difficult within populations where the total score is confounded by many 
demographic and curricular factors. Simply translating from one language to another may not produce 
construct measures that are similar enough to be acceptable. Test translation from English to French and 
English to Spanish had demonstrated concurrent validity and predictive validity in longitudinal studies. (Fang, 
W. S., Chew, A.F. & Schomber, J., 1992; Venet, M., et. ah, 2003), but had not been examined for DIF in 
comparative settings. One reason that such studies would be unpopular is simply the sample size necessary to 
detect item bias, and even in this study the number of subjects is adequate, but not as large as desirable. In this 
case, a conclusion is that the English and Spanish editions of the Lollipop Test appear reasonably equivalent, 
but the French form appears to have bias on items that is significant and warrants item and translation review. 
Rasch model DIF exposed the contrasts where other research had missed it. 

Another conclusion of the authors is that DIF analysis using the Rasch model revealed a number of 
possible item discrepancies that were hidden in previous classical studies incorporating factor analysis, 
regression line comparisons, discriminant analysis, and canonical correlations. The Rasch analysis provides a 
powerful tool for detection of item response to DIF in different language translations of tests. Even though the 
burden of item writing and construct consideration is always part of test authorship, an analysis that pinpoints 
potential issues and provides clues to the questions at hand is welcome indeed. 

A possible advantage in the use of Rasch software such as RUMM2020 is the split item function so that 
a bias item can be scored differently for different groups of test takers. Whether a given measurement 
philosophy or application believes that method to be appropriate, it certainly is another useful tool to be able to 
equate total scores or subtest scores by simply creating a separate item for the biased group and scoring it 
differently (Andrich, 2004b, p. 31-39). The Amend Sample Size function in RUMM2020 that allowed a 
"prophecy formula" for Rasch analysis when large sample sizes are not easily obtainable is also an interesting 
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and useful tool. A recommendation would be to explore if these software functions were able to provide 
adjusted equating of a limited amount of item or subtest bias and remain true to the construct. 

Even though conclusions based on this sample size and the limitation of analysis with no common 
people or items would require further investigation after some test corrections, this exercise was revealing to the 
authors and convincing that the French version of the Lollipop Test was not performing as equivalent compared 
to the Spanish and English versions. This was completely unknown prior to this study and gives the authors 
another challenge for the future. 
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